Abstract
Model explainability has gained more and more attention recently among machine learning practitioners. Especially with the popularization of deep learning frameworks, which further promotes the use of increasingly complicated models to improve accuracy. In the reality, however, model with the highest accuracy may not be the one that can be deployed. Trust is one important factor affecting the adoption of complicated models. In this notebook we give a brief introduction to several popular methods on model explainability. And we focus more on the hands-on which demonstrates how we can actually explain a model, under a variety of model families.
This notebook is written with reticulate, a package that allows inter-operation between R and Python.
Why do we need to explain a machine learning model? The benefit of an explanable model against a black-box model is for the model to be trusted. Trust can be important in many real applications where the successful deployment of a machine learning model requires the trust from end users. Sometimes trust plays a even bigger role than model accuracy.
Other than trust, model explainability (or interpretability, interchangeably used hereafter) may also guide us in the correct direction to further improve the model.1
In general, linear model is more interpretable than non-linear model. But the former also suffers from lower accuracy. More advanced and hence complicated model usually has worse interpretability.
One should not confuse model explainability with the actual causality. Being able to explain a model doesn’t mean that we can identify any ground-truth causal relation behind the model. Model explainability is for and only for the model, but not for the facts we’d like to model. Nevertheless, understand how we can reason the model definitely will help us better model the actual pattern behind the scence.
In this notebook we will walk through 3 popular approaches of model prediction explanation, each of them comes with a dedicated Python package:
An explanation model \(g(x\prime)\) is an interpretable approximation of the original model \(f(x)\). Its sole purpose is to give extra explainability the original model fails to provide, due to its own complexity.
The general idea is to use a simplified input \(x\prime\) such that \(x = h_x(x\prime)\), where \(h_x(\cdot)\) is a mapping function for any given raw input \(x\). Then the interpretable approximation can be written as:
\[ g(x\prime) \approx f(h_x(x\prime)). \]
The additive feature attribution methods specify the explanation model of the following form:
\[ g(x\prime) = \phi_0 + \sum_{i = 1}^m \phi_i x_i\prime, \]
where \(m\) is total number of simplified features, \(x\prime \in \{0, 1\}\) simply an indicator.2 Apparently, the choice of an additive model is for (linear) intrepretability. The simplified features are an interpretable representation of the original model features.
As we will see additivity is the key to explainability. All the approaches we will discuss in this notebook follow this philosophy.
One very popular such above additive model is LIME (Ribeiro, Singh, and Guestrin (2016)). LIME stands for Local Interpretable Model-Agnostic Explanations. As its full name suggests, LIME can be applied to any machine learning model. LIME achieves prediction-level interpretability by approxmiating the original model with an explanation model locally around that prediction.
From their original paper:
By “explaining a prediction”, we mean presenting textual or visual artifacts that provide qualitative understanding of the relationship between the instance’s components (e.g. words in text, patches in an image) and the model’s prediction.
Feature space in the original model will be in general different from that of the explanation model. An explanation model will use interpretable representation of the original feature as their training input. Different data type will have their different interpretable representation.
LIME proposes an explanation model \(g(x\prime)\) with a domain \(\{0, 1\}\). That is, it acts on absence or presence of the interpretable features \(x\prime\). The choice is, obviously, for better interpretability.
Language Data
For text classification problems with human language as source input, the most straightforward interpretable representation will be a binary indicator vector of bag of words. So the explanation model will try to reason which word or token is driving the prediction in what direction. And this is true no matter the form of the original model feature. May it be a word count matrix, a term frequency-inverse document frequency (TF-IDF) matrix, or numerical embeddings.
Image Data
For image tasks, the interpretable representation is a joint set of contiguous superpixels that divide the original image into pieces. A superpixel is a group of pixels with similar characteristics. So in plain words, just like we segment sentence into tokens, we simply segment image into multiple small pieces and again, use a binary vector to indicate the absence or presence of each piece for latter perturbation purpose.
Numerical Data
TBC.
In order to estimate the explanation model given a prediction for a target example, features transformed into a interpretable space are then perturbed to generate similar examples around that target example. This is referred to as sampling for local exploration.
Given an example \(x\prime\) to be explained, its non-zero interpretable features (remember the space has a domain of \(\{0, 1\}\)) are uniformly sampled to generate its local similar example \(z\prime\). So \(z\prime\) will always have a subset (or at most equal set) of non-zero features that \(x\prime\) has.
Take text data for illustration. If a particular example has the following tokens in the interpretable space:
A B C H I J
Then a possible local sample can be something like:
A C H J
(Imagine those alphabets are actual words present in the raw text of the example.)
By default in the paper 5,000 samples are generated for each single explanation. A hyperparameter \(K\) (default at 10) is used to cap how many non-zero interpretable features we’d like to estimate in the subsequent model learning phase, to not only keep the model solving tractable but also manageable for human interpretation.
Now each of the perturbed example will be first transformed back to their original feature space, then feed into the original model to get the predicted label. That is, from \(z\prime\) we need to get \(f(z)\) where \(f\) is the original model and \(z\) the original feature representation. These labels serve exactly as the labels to train the local explanation model, where all random perturbations \(z\) are weigthed by \(\pi_x(z)\), a proximity function \(\pi_x(z)\) can be defined to measure how close \(z\) is to \(x\), in the original space.
In the original paper the proximity function is set to be an exponential kernel:
\[ \pi_x(z) = \exp \bigg( \frac{-D(x, z)^2}{\sigma^2} \bigg), \]
where \(D(x, z)\) is cosine distance for text and L2 distance for image, \(\sigma\) is a hyperparameter default at 25 in lime.
The learner is a simple linear model:
\[ g(x\prime) = W \cdot x\prime. \]
We learn the explanation model weights \(W\) by minimizing the sum of proximity-weighted squared losses for all perturbed local samples:
\[ Loss = \sum_{z, z\prime} \pi_x(z) \cdot \big(f(z) - g(z\prime)\big)^2. \]
The actual learning algorithm proposed by LIME in the original paper is a LASSO. But in the actual implementation of the lime package, Ridge regression is used instead as the default learner. Despite this, the top \(K\) features for the learner are still chosen by a LASSO path.
Another discrepancy between the original paper and the actual implementation is the notion of proximity function \(\pi_x(z)\). In the actual implementation proximity is calculated in the interpretable space rather than in the original space. So essentially we should have denoted the function as \(\pi_{x\prime}(z\prime)\). Indeed, in the package source code the proximity function is defined as:
\[ \pi_{x\prime}(z\prime) = \sqrt{ \exp \bigg( \frac{-(D(x\prime, z\prime) \times 100)^2}{\sigma^2} \bigg)}. \]
As one may realize now that the original model can be a total blackbox. We only use its predictions as labels to learn the explanation model. Be aware that here \(f(z)\) returns the predicted probability as label so the explanation model is a regressor not a classifier.
Linearity
Notice that the local explanation model is a linear model, the explanation hence is subject to linearity. If we have any evidence suggesting heavy non-linearity around a prediction, the output of such explanation model won’t be faithful.
No Explanation for a NULL Effect
And for the local sampling, notice that we only subsample from the presence of features on the target example. So the explanation is on the presence or absence of the anything actually present in the target example. A feature that is originally not present in the example can never be part of the explanation. This could potentially miss an important null effect of a feature.
Take the same example above:
A B C H I J
It could be the case that, instead of the presence of the 6 features, the missingness of feature Z is the most important driving force for the machine learning model to make the prediction. However due to the local sampling scheme, the importance of (the null of) Z can never be estimated by the explanation model.
We use Large Movie Review Dataset to do a binary sentiment classification exercise. We will use machine learning libraries such as scikit-learn and tensorflow to quickly build a varieity of (rather complicated and hard to interpret) models and use lime to experiment explanation modeling.
3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)]
import os
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)
import warnings
warnings.simplefilter(action="ignore", category=UserWarning)
warnings.simplefilter(action="ignore", category=FutureWarning)
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
print(tf.__version__)2.0.0
/device:GPU:0
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import train_test_split
import joblib
print(sklearn.__version__)0.22
# Create model dir to cache all models trained in the notebook.
model_dir = "models"
if not os.path.exists(model_dir):
os.makedirs(model_dir)
# Directory to cache dataset.
home = os.path.expanduser("~")
cache_dir = os.path.join(home, ".keras")First, we prepare the movie review dataset.3
import tensorflow_datasets as tfds
# Load the data as tf.data.Dataset.
imdb = tfds.load(name="imdb_reviews", as_supervised=True,
data_dir=os.path.join(home, "tensorflow_datasets"))The dataset is a perfectly balanced dataset with 50,000 examples, half for positive and half for negative sentiment.
# Extract all texts as list since we want to use libraries other than tensorflow as well.
# And since this is a small dataset, we don't care about memory usage.
# We skip the use of a dataset iterator.
imdb_reviews_train = []
imdb_reviews_test = []
imdb_y_train = []
imdb_y_test = []
for x, y in imdb["train"].batch(128):
imdb_reviews_train.extend(x.numpy())
imdb_y_train.extend(y.numpy())
for x, y in imdb["test"].batch(128):
imdb_reviews_test.extend(x.numpy())
imdb_y_test.extend(y.numpy())
# TF works on bytes, but some other packages may only work on decoded string.
imdb_reviews_train = [b.decode("utf8") for b in imdb_reviews_train]
imdb_reviews_test = [b.decode("utf8") for b in imdb_reviews_test]
imdb_y_train = np.array(imdb_y_train)
imdb_y_test = np.array(imdb_y_test)
# Take one review.
print(imdb_reviews_train[87])Any movie that portrays the hard-working responsible husband as the person who has to change because of bored, cheating wife is an obvious result of 8 years of the Clinton era.<br /><br />It's little wonder that this movie was written by a woman.
0
We use the data prepared by tensorflow-datasets here just to save some time. For those who want to process the data in its very original format (where one review is in one .txt file), the files can be downloaded by this piece of code:
imdb_remote_path = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
imdb_fname = os.path.basename(imdb_remote_path)
imdb_local_path = os.path.join(cache_dir, "datasets", imdb_fname)
if not os.path.exists(imdb_local_path):
_ = tf.keras.utils.get_file(fname=imdb_fname, origin=imdb_remote_path,
extract=True, cache_dir=cache_dir)Let’s build a random forest with TF-IDF as our feature space. We will use the popular scikit-learn library for implementation.4
# We drop words that are too frequent or too rare in the training dataset.
imdb_vectorizer = TfidfVectorizer(lowercase=True, min_df=10, max_df=.9)
imdb_X_train = imdb_vectorizer.fit_transform(imdb_reviews_train)
imdb_X_test = imdb_vectorizer.transform(imdb_reviews_test)
print(len(imdb_vectorizer.vocabulary_)) # Without OOV token.18518
imdb_rf_model_file = os.path.join(model_dir, "text_rf.joblib")
# Save/reload the model to save notebook rendering time.
if os.path.exists(imdb_rf_model_file):
imdb_rf = joblib.load(imdb_rf_model_file)
else:
imdb_rf = RandomForestClassifier(n_estimators=300, random_state=64, n_jobs=-2)
_ = imdb_rf.fit(imdb_X_train, imdb_y_train)
_ = joblib.dump(imdb_rf, imdb_rf_model_file)
imdb_rf_pred = imdb_rf.predict(imdb_X_test)
imdb_rf_yhat = imdb_rf.predict_proba(imdb_X_test)[:,1]
print(classification_report(imdb_y_test, imdb_rf_pred)) precision recall f1-score support
0 0.84 0.86 0.85 12500
1 0.86 0.84 0.85 12500
accuracy 0.85 25000
macro avg 0.85 0.85 0.85 25000
weighted avg 0.85 0.85 0.85 25000
0.9274221727999999
As a baseline without extensive tuning (we didn’t tune anything indeed!), random forest seems to perform fairly well on this dataset.
As part of the algorithm’s design we are able to derive a global view of feature importance. This is based on how much each feature can reduce the impurity during all tree splittings. For example, we can plot the top 20 features:
sorted_vocab = sorted(imdb_vectorizer.vocabulary_.items(), key=lambda kv: kv[1])
sorted_vocab = [w for w, i in sorted_vocab]
imdb_rf_feat_imp = pd.Series(imdb_rf.feature_importances_, index=sorted_vocab).sort_values()
ax = imdb_rf_feat_imp.tail(20).plot(kind="barh")
plt.show()As one can see, common adjectives describing good or bad things generally have larger impact in the model, which is totally expected. But we also see influential words such as just and minutes which are quite neutral and contain no useful information on their own. They may be jointly important in the model since a tree model allows interaction between variables. But we won’t be able to go deeper beyond the unconditional view we derived as a global feature ranking.
Interpretation of the impurity-based ranking must be very careful. For example, related features will theoretically have similar impact but only one of it will gain higher score (and suppress the other) in the ranking. Which one stands out is totally random due to the way tree splitting is performed during training.
In general it is NOT recommended to use impurity or loss-based feature ranking to interpret a tree ensemble model. Such ranking information is still useful to understand different aspects of the model, and can be used to subset feature to counter over-fitting issue, if any. But it won’t help really explain the model at the prediction-level: Why is my model making such prediction? And this is exactly why we need a explanation model in the first place.
Now move on to model explanation with LIME. We pick up one true positive and one false positive case made by our random forest model to see how the explanation model will explain each case.
from lime.lime_text import LimeTextExplainer
# We need a pipeline since LimeTextExplainer.explain_instance expects raw text input.
imdb_rf_pipe = make_pipeline(imdb_vectorizer, imdb_rf)
imdb_rf_explainer = LimeTextExplainer(class_names=["Negative", "Positive"], random_state=64)
imdb_rf_tp_idx = np.where(np.logical_and(imdb_rf_pred == 1, imdb_y_test == 1))[0]
imdb_rf_fp_idx = np.where(np.logical_and(imdb_rf_pred == 1, imdb_y_test == 0))[0]
# We take one true positive and one false positive example to demo explanation.
imdb_rf_tp_exp = imdb_rf_explainer.explain_instance(
imdb_reviews_test[imdb_rf_tp_idx[0]], imdb_rf_pipe.predict_proba, num_features=6)
imdb_rf_fp_exp = imdb_rf_explainer.explain_instance(
imdb_reviews_test[imdb_rf_fp_idx[0]], imdb_rf_pipe.predict_proba, num_features=6)
# For ipynb, one can simply call imdb_tp_exp.show_in_notebook(text=True) to embed the html output.
imdb_rf_tp_exp.save_to_file("/tmp/explain_text_rf_tp.html")
imdb_rf_fp_exp.save_to_file("/tmp/explain_text_rf_fp.html")Our RF model doesn’t seem to be very confident on this particular positive example indeed. There is no dominant single word can drive the prediction in the correct direction. The contributing words are also mostly neutral on their own. We can confirm that the result of this prediction will be very sensitive and not robust. Admittedly this review does show some mixtures of positive and negative views.
Now let’s look at a false positive example, where our RF model wrongly labeled as a positive review.
In this example a single positive word great (wrongly) dominate the prediction toward a positive sentiment. And we realize the model didn’t response well to some negative signals, especially for the word bore.
If we examine more cases we may have more clues on how the model mis-behaves, and we can come up with a strategy accordingly to improve it. For now we’ll stop here and try experimenting with other learning algorithms hereafer.
Let’s use the text data example above to build a LIME model from scratch to better understand every detail of the technique.
from scipy.sparse import csr_matrix
from sklearn.linear_model import Ridge, lars_path
from sklearn.metrics.pairwise import cosine_distances
N = 1000 # Number of local samples.
K = 10 # Number of features to used.
i = imdb_rf_tp_idx[0] # The target example.
x = imdb_X_test[i]
local_samples_x = csr_matrix(np.ones([N, 1])) * x # Container for perturbation.
present_tok_id = x.nonzero()[1] # Present features.
# Generate random design matrix for the explanation model.
# This is under the interpretable (binary) space.
local_samples_z = (np.random.uniform(size=(N, len(present_tok_id))) > .5).astype(int)
# Predict local samples by the original model.
remove_ind = pd.DataFrame(zip(*np.where(local_samples_z == 0)), columns=["rid", "pos"])
for r in range(1, N):
# We keep the first sample as the original target example.
df = remove_ind[remove_ind.rid == r]
if not df.empty:
tok_ids_to_remove = present_tok_id[df.pos]
local_samples_x[r,tok_ids_to_remove] = 0
local_samples_x.eliminate_zeros()
local_samples_y = imdb_rf.predict_proba(local_samples_x)[:,1]
# Calculate proximity weights under the interpretable space.
def pi_x(z):
kernel_width = 25
dist = cosine_distances(z[0].reshape(1, -1), z).ravel()
return np.sqrt(np.exp(-((dist * 100) ** 2) / kernel_width ** 2))
weights = pi_x(local_samples_z)
# Subset top K features with LAR path.
weighted_z = ((local_samples_z - np.average(local_samples_z, axis=0, weights=weights))
* np.sqrt(weights[:, np.newaxis]))
weighted_y = ((local_samples_y - np.average(local_samples_y, weights=weights))
* np.sqrt(weights))
_, _, coefs = lars_path(weighted_z, weighted_y, method="lasso")
nonzero_coefs = range(weighted_z.shape[1])
for i in range(len(coefs.T) - 1, 0, -1):
nonzero_coefs = coefs.T[i].nonzero()[0]
if len(nonzero_coefs) <= 10:
break
# Learn the explanation model.
explainer = Ridge(alpha=1, fit_intercept=True, random_state=64)
_ = explainer.fit(local_samples_z[:,nonzero_coefs], local_samples_y, sample_weight=weights)
# Fitness.
# This can be a score to judge how good the local approximation is.
print(explainer.score(local_samples_z[:,nonzero_coefs], local_samples_y, sample_weight=weights))0.8051909882251267
exp = pd.DataFrame({
"tok": np.array(sorted_vocab)[present_tok_id[nonzero_coefs]],
"imp": explainer.coef_
})
print(exp.sort_values("imp", ascending=False)) tok imp
5 love 0.044385
0 also 0.030580
4 job 0.020723
9 very 0.018468
1 been -0.020340
2 better -0.024194
3 even -0.042356
7 only -0.044749
8 thing -0.062265
6 nothing -0.079193
We try a smaller local sample size in our exercise, but we can already successfully calculate very closely the feature contribution scores as in lime’s API.
Now let’s try a shallow neural network model with word embeddings trained from scratch. We use tensorflow.keras API to quickly build and train a neural net. We average word embeddings as the document embeddings for each review, then feed-forward a ReLU layer before the sigmoid activation for cross-entropy optimization.
As an exercise, instead of re-using the vocabulary built by TfidfVectorizer with scikit-learn, we will re-tokenize the text data with keras.preprocessing module. The inherent consistency under the Keras framework will also simplify our latter works on network layering.
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Build vocabulary. We use similar size as in our previous TfidfVectorizer.
# Since we will use zero padding, 0 cannot be used as OOV index.
# Keras tokenizer by default reserves 0 already. OOV token, if used, will be indexed at 1.
# Note that len(tokenizer.index_word) will be all vocabulary instead of `num_words`.
vocab_size = 20001 # +1 for 0 index used for padding.
oov_token = "<unk>"
tokenizer = Tokenizer(lower=True, oov_token=oov_token, num_words=vocab_size)
tokenizer.fit_on_texts(imdb_reviews_train)
# Encode text with padding to ensure fixed-length input.
seq_train = tokenizer.texts_to_sequences(imdb_reviews_train)
seq_train_padded = pad_sequences(seq_train, padding="post")
maxlen = seq_train_padded.shape[1]
seq_test = tokenizer.texts_to_sequences(imdb_reviews_test)
seq_test_padded = pad_sequences(seq_test, padding="post", maxlen=maxlen)
assert tokenizer.index_word[1] == oov_token
assert seq_train_padded.max() == vocab_size - 1
# Wrap Keras Sequential model with scikit-learn API.
# This is because LimeTextExplainer seems buggy with a native Keras model.
nn_model_file = os.path.join(model_dir, "text_clf_nn.h5")
def nn_model_fn():
embedding_size = 64
model = tf.keras.Sequential([
tf.keras.layers.Embedding(
vocab_size, embedding_size, input_length=maxlen,
mask_zero=True, name="word_embedding"),
tf.keras.layers.GlobalAveragePooling1D(name="doc_embedding"),
tf.keras.layers.Dense(embedding_size / 2, activation="relu", name="relu"),
tf.keras.layers.Dense(1, activation="sigmoid", name="sigmoid")
], name="nn_classifier")
model.compile(optimizer="adam",
loss="binary_crossentropy",
metrics=["accuracy"])
return model
print(nn_model_fn().summary(line_length=90))Model: "nn_classifier"
__________________________________________________________________________________________
Layer (type) Output Shape Param #
==========================================================================================
word_embedding (Embedding) (None, 2493, 64) 1280064
__________________________________________________________________________________________
doc_embedding (GlobalAveragePooling1D) (None, 64) 0
__________________________________________________________________________________________
relu (Dense) (None, 32) 2080
__________________________________________________________________________________________
sigmoid (Dense) (None, 1) 33
==========================================================================================
Total params: 1,282,177
Trainable params: 1,282,177
Non-trainable params: 0
__________________________________________________________________________________________
None
imdb_nn = tf.keras.wrappers.scikit_learn.KerasClassifier(nn_model_fn)
if not os.path.exists(nn_model_file):
metrics = imdb_nn.fit(
x=seq_train_padded, y=imdb_y_train,
batch_size=256, epochs=10,
validation_data=(seq_test_padded, imdb_y_test),
validation_steps=20,
callbacks=[
tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2),
tf.keras.callbacks.ModelCheckpoint(nn_model_file, monitor="val_loss", save_best_only=True)
],
verbose=2)
# Restore the model with wrapper.
imdb_nn.model = tf.keras.models.load_model(nn_model_file)
imdb_nn.classes_ = np.array([0, 1])
imdb_nn_yhat = imdb_nn.predict_proba(seq_test_padded)[:,1]
imdb_nn_pred = (imdb_nn_yhat > .5).astype(int)
print(classification_report(imdb_y_test, imdb_nn_pred)) precision recall f1-score support
0 0.89 0.89 0.89 12500
1 0.89 0.89 0.89 12500
accuracy 0.89 25000
macro avg 0.89 0.89 0.89 25000
weighted avg 0.89 0.89 0.89 25000
0.9535356512000001
Based on the testing AUC score, our shallow neural network model did outperform a random forest. Let’s see how the explanation model tell us about the behavior of the neural network model.
def nn_predict_fn(text):
# This is for sklearn wrapper only.
seq = tokenizer.texts_to_sequences(text)
seq = pad_sequences(seq, padding="post", maxlen=maxlen)
return imdb_nn.predict_proba(seq)
imdb_nn_explainer = LimeTextExplainer(class_names=["Negative", "Positive"])
# Explain the same examples as in RF.
imdb_nn_tp_exp = imdb_nn_explainer.explain_instance(
imdb_reviews_test[imdb_rf_tp_idx[0]], nn_predict_fn, num_features=6)
imdb_nn_fp_exp = imdb_nn_explainer.explain_instance(
imdb_reviews_test[imdb_rf_fp_idx[0]], nn_predict_fn, num_features=6)
imdb_nn_tp_exp.save_to_file("/tmp/explain_text_nn_tp.html")
imdb_nn_fp_exp.save_to_file("/tmp/explain_text_nn_fp.html")The above is the LIME explanation of the same positive example previously explained with a RF model. We realize that, though both models eventually give a positive prediction, the neural network model has a very different opinion on how the positive prediction is formulated. Instead of being confused and indecisive, the NN model is actually over-confident about this prediction! Some neutral words have disproportionate contribition to the positive, pointing out the potential direction to improve the model. For example, can a bigram tokenizer be better?
How about the second example (which is a negative review)? Our NN model also makes a mistake on this negative review, by predicting it as a positive one.
What’s different here is the reaction to the negative word bore, which is not seen in RF.
Without a explanation model, it won’t be easy for us to compare two models at this level of details.
One step further, let’s use pre-trained word embeddings for the neural nets and build another explanation model. We will use GloVe (Pennington, Socher, and Manning (2014)). We use just the smaller GloVe model since our dataset is quite small.
# Download GloVe pre-trained embeddings.
# The file is about 800MB so may take some time.
glove6b_remote_path = "http://nlp.stanford.edu/data/glove.6B.zip"
glove6b_local_path = os.path.join(cache_dir, "datasets", "glove.6B.50d.txt")
glove6b_fname = os.path.basename(glove6b_remote_path)
if not os.path.exists(glove6b_local_path):
_ = tf.keras.utils.get_file(fname=glove6b_fname, origin=glove6b_remote_path,
extract=True, cache_dir=cache_dir)
glove_all = pd.read_csv(glove6b_local_path, sep=" ", header=None, index_col=0, quoting=3)In building the GloVe embeddings we need to take special care about out-of-vocabulary token AND padding index since we will be using the Keras API.
# Map vocabulary to pre-trained embeddings.
matched_toks = []
for i, w in tokenizer.index_word.items():
if i < vocab_size:
if w in glove_all.index:
matched_toks.append(w)
else:
matched_toks.append(oov_token)
# Note that GloVe pre-trained embeddings does not include its own OOV token.
# We will use a global average embedding to represent OOV token.
print(len([t for t in matched_toks if t == oov_token])) # How many OOVs?861
glove_all.loc[oov_token] = glove_all.values.mean(axis=0)
glove = glove_all.loc[matched_toks].values
# Append dummy 0-index vector to support padding.
glove = np.vstack([np.zeros((1, glove.shape[1])), glove])
print(glove.shape)(20001, 50)
Now let’s build the neural network. Most of the code will be the same as before, only the Embedding layer now we will use a constant matrix for initialization. We make the GloVe embeddings trainable so it will further adapt to our specific dataset.
tr_model_file = os.path.join(model_dir, "text_clf_tr.h5")
def tr_model_fn():
embedding_size = glove.shape[1]
model = tf.keras.Sequential([
tf.keras.layers.Embedding(
vocab_size, embedding_size, input_length=maxlen,
embeddings_initializer=tf.keras.initializers.Constant(glove),
trainable=True, mask_zero=True, name="glove_embedding"),
tf.keras.layers.GlobalAveragePooling1D(name="doc_embedding"),
tf.keras.layers.Dense(embedding_size / 2, activation="relu", name="relu"),
tf.keras.layers.Dense(1, activation="sigmoid", name="sigmoid")
], name="tr_classifier")
model.compile(optimizer="adam",
loss="binary_crossentropy",
metrics=["accuracy"])
return model
print(tr_model_fn().summary(line_length=90))Model: "tr_classifier"
__________________________________________________________________________________________
Layer (type) Output Shape Param #
==========================================================================================
glove_embedding (Embedding) (None, 2493, 50) 1000050
__________________________________________________________________________________________
doc_embedding (GlobalAveragePooling1D) (None, 50) 0
__________________________________________________________________________________________
relu (Dense) (None, 25) 1275
__________________________________________________________________________________________
sigmoid (Dense) (None, 1) 26
==========================================================================================
Total params: 1,001,351
Trainable params: 1,001,351
Non-trainable params: 0
__________________________________________________________________________________________
None
imdb_tr = tf.keras.wrappers.scikit_learn.KerasClassifier(tr_model_fn)
if not os.path.exists(tr_model_file):
imdb_tr = tf.keras.wrappers.scikit_learn.KerasClassifier(tr_model_fn)
metrics = imdb_tr.fit(
x=seq_train_padded, y=imdb_y_train,
batch_size=256, epochs=20,
validation_data=(seq_test_padded, imdb_y_test),
validation_steps=20,
callbacks=[
tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2),
tf.keras.callbacks.ModelCheckpoint(tr_model_file, monitor="val_loss", save_best_only=True)
],
verbose=2)
# Restore the model with wrapper.
imdb_tr.model = tf.keras.models.load_model(tr_model_file)
imdb_tr.classes_ = np.array([0, 1])
imdb_tr_yhat = imdb_tr.predict_proba(seq_test_padded)[:,1]
imdb_tr_pred = (imdb_tr_yhat > .5).astype(int)
print(classification_report(imdb_y_test, imdb_tr_pred)) precision recall f1-score support
0 0.88 0.89 0.89 12500
1 0.89 0.88 0.88 12500
accuracy 0.88 25000
macro avg 0.88 0.88 0.88 25000
weighted avg 0.88 0.88 0.88 25000
0.9521696352000001
Our NN model with transfer learning has similar AUC score to the vanilla NN. Let’s use explanation modeling to see if there is any actual difference.
def tr_predict_fn(text):
# This is for sklearn wrapper only.
seq = tokenizer.texts_to_sequences(text)
seq = pad_sequences(seq, padding="post", maxlen=maxlen)
return imdb_tr.predict_proba(seq)
imdb_tr_explainer = LimeTextExplainer(class_names=["Negative", "Positive"])
# Explain the same examples as in RF.
imdb_tr_tp_exp = imdb_tr_explainer.explain_instance(
imdb_reviews_test[imdb_rf_tp_idx[0]], tr_predict_fn, num_features=6)
imdb_tr_fp_exp = imdb_tr_explainer.explain_instance(
imdb_reviews_test[imdb_rf_fp_idx[0]], tr_predict_fn, num_features=6)
imdb_tr_tp_exp.save_to_file("/tmp/explain_text_tr_tp.html")
imdb_tr_fp_exp.save_to_file("/tmp/explain_text_tr_fp.html")For the same positive review, again the model shows over-confidence. Even the donimant words are the same.
For the negative review, interestingly, the transfer learning NN indeed makes a correct prediction of negative label. The word bore becomes the main driving force to lower down the score.
As a final exercise on text classification, let’s experiment the explanation modeling with a recurrent neural network (RNN) RNN is known to be able to capture sequential dependencies better than ngram bag-of-words approach.5
rnn_model_file = os.path.join(model_dir, "text_clf_rnn.h5")
def rnn_model_fn():
embedding_size = glove.shape[1]
model = tf.keras.Sequential([
tf.keras.layers.Embedding(
vocab_size, embedding_size, input_length=maxlen,
embeddings_initializer=tf.keras.initializers.Constant(glove),
trainable=True, mask_zero=True, name="glove_embedding"),
tf.keras.layers.GRU(64, dropout=.2, name="GRU"),
tf.keras.layers.Dense(1, activation="sigmoid", name="sigmoid")
], name="rnn_classifier")
model.compile(optimizer="adam",
loss="binary_crossentropy",
metrics=["accuracy"])
return model
print(rnn_model_fn().summary(line_length=90))Model: "rnn_classifier"
__________________________________________________________________________________________
Layer (type) Output Shape Param #
==========================================================================================
glove_embedding (Embedding) (None, 2493, 50) 1000050
__________________________________________________________________________________________
GRU (GRU) (None, 64) 22272
__________________________________________________________________________________________
sigmoid (Dense) (None, 1) 65
==========================================================================================
Total params: 1,022,387
Trainable params: 1,022,387
Non-trainable params: 0
__________________________________________________________________________________________
None
imdb_rnn = tf.keras.wrappers.scikit_learn.KerasClassifier(rnn_model_fn)
if not os.path.exists(rnn_model_file):
metrics = imdb_rnn.fit(
x=seq_train_padded, y=imdb_y_train,
batch_size=32, epochs=10,
validation_data=(seq_test_padded, imdb_y_test),
validation_steps=20,
callbacks=[
tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2),
tf.keras.callbacks.ModelCheckpoint(rnn_model_file, monitor="val_loss", save_best_only=True)
],
verbose=2)
# Restore the model with wrapper.Train on 25000 samples, validate on 25000 samples
Epoch 1/10
25000/25000 - 53s - loss: 0.4785 - accuracy: 0.7562 - val_loss: 0.0098 - val_accuracy: 0.8547
Epoch 2/10
25000/25000 - 47s - loss: 0.2615 - accuracy: 0.8956 - val_loss: 0.0076 - val_accuracy: 0.8953
Epoch 3/10
25000/25000 - 47s - loss: 0.1908 - accuracy: 0.9272 - val_loss: 0.0062 - val_accuracy: 0.8984
Epoch 4/10
25000/25000 - 46s - loss: 0.1470 - accuracy: 0.9442 - val_loss: 0.0066 - val_accuracy: 0.9094
Epoch 5/10
25000/25000 - 46s - loss: 0.1129 - accuracy: 0.9596 - val_loss: 0.0065 - val_accuracy: 0.9078
imdb_rnn.model = tf.keras.models.load_model(rnn_model_file)
imdb_rnn.classes_ = np.array([0, 1])
imdb_rnn_yhat = imdb_rnn.predict_proba(seq_test_padded)[:,1] # Interence of RNN take time.
imdb_rnn_pred = (imdb_rnn_yhat > .5).astype(int)
print(classification_report(imdb_y_test, imdb_rnn_pred)) precision recall f1-score support
0 0.86 0.94 0.90 12500
1 0.93 0.85 0.89 12500
accuracy 0.89 25000
macro avg 0.90 0.89 0.89 25000
weighted avg 0.90 0.89 0.89 25000
0.9638315199999999
RNN with pre-trained GloVe embeddings seems to work very well, even for such a small dataset. That’s see how the explanation can differ, again, for the same two examples:
def rnn_predict_fn(text):
# This is for sklearn wrapper only.
seq = tokenizer.texts_to_sequences(text)
seq = pad_sequences(seq, padding="post", maxlen=maxlen)
return imdb_rnn.predict_proba(seq)
imdb_rnn_explainer = LimeTextExplainer(class_names=["Negative", "Positive"])
# Explain the same examples as in RF.
imdb_rnn_tp_exp = imdb_rnn_explainer.explain_instance(
imdb_reviews_test[imdb_rf_tp_idx[0]], rnn_predict_fn, num_features=6)
imdb_rnn_fp_exp = imdb_rnn_explainer.explain_instance(
imdb_reviews_test[imdb_rf_fp_idx[0]], rnn_predict_fn, num_features=6)
imdb_rnn_tp_exp.save_to_file("/tmp/explain_text_rnn_tp.html")
imdb_rnn_fp_exp.save_to_file("/tmp/explain_text_rnn_fp.html")The same over-confidence for all NN models on this particular positive review.
For the negative review, RNN also correctly predict the label. This may relate to they both using the pre-trained embeddings.
Lots of data can be represented in tabular format. Here we will use UCI Heart Disease dataset for demo. Particularly, we use the Cleveland dataset which is commonly used in machine learning research.6
ucihd_remote_path = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
ucihd_fname = os.path.basename(ucihd_remote_path)
ucihd_local_path = os.path.join(cache_dir, "datasets", ucihd_fname)
if not os.path.exists(ucihd_local_path):
_ = tf.keras.utils.get_file(fname=ucihd_fname, origin=ucihd_remote_path,
extract=False, cache_dir=cache_dir)The dataset contains both numerical and categorical features (all encoded in numerics already, please refer to the in-line comments for documentation). It is tiny in both number of features and number of examples. But as a demo case it should serve well the purpose.
ucihd_attr = [
"age",
"sex", # 0 = female 1 = male
"cp", # chest pain type 1: typical angina 2: atypical angina 3: non-anginal pain 4: asymptomatic
"trestbps", # resting blood pressure (in mm Hg on admission to the hospital)
"chol", # serum cholestoral in mg/dl
"fbs", # (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
"restecg", # resting electrocardiographic results 0: normal 1: having ST-T wave abnormality 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
"thalach", # maximum heart rate achieved
"exang", # exercise induced angina (1 = yes; 0 = no)
"oldpeak", # ST depression induced by exercise relative to rest
"slope", # the slope of the peak exercise ST segment
"ca", # number of major vessels (0-3) colored by flouroscopy
"thal", # 3 = normal; 6 = fixed defect; 7 = reversable defect
"label" # diagnosis of heart disease (angiographic disease status) 0: < 50% diameter narrowing 1-4: > 50% diameter narrowing
]
ucihd = pd.read_csv(ucihd_local_path, header=None, names=ucihd_attr, na_values="?")
categorical_attr = ["sex", "cp", "fbs", "restecg", "exang", "thal"]
for col in categorical_attr:
ucihd[col] = ucihd[col].astype("category")
# Clean label.
ucihd.loc[ucihd["label"] > 1, "label"] = 1
print(ucihd.shape)(303, 14)
label
0 164
1 139
dtype: int64
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal label
0 63.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3 3.0 0.0 6.0 0
1 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 1
2 67.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6 2.0 2.0 7.0 1
3 37.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5 3.0 0.0 3.0 0
4 41.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4 1.0 0.0 3.0 0
Again we try to explain tree ensembels.
# sklearn's implementation of RF doesn't allow missing value.
# For categorical (as string) we can leave one special category for missing,
# but for numerical we need to do some special encoding or imputation.
ucihd_2 = ucihd.copy()
ucihd_2.loc[ucihd_2["ca"].isna(), "ca"] = -1 # Encode missing numerical.
# One-hot encode all categorical features.
ucihd_2 = pd.get_dummies(ucihd_2, columns=categorical_attr, dummy_na=True)
ucihd_y = ucihd_2.pop("label")
ucihd_X_train, ucihd_X_test, ucihd_y_train, ucihd_y_test = train_test_split(
ucihd_2, ucihd_y.values, test_size=.3, random_state=64)
ucihd_rf = RandomForestClassifier(n_estimators=100, random_state=64)
_ = ucihd_rf.fit(ucihd_X_train, ucihd_y_train)
ucihd_rf_yhat = ucihd_rf.predict_proba(ucihd_X_test)[:,1]
ucihd_rf_pred = ucihd_rf.predict(ucihd_X_test)
print(classification_report(ucihd_y_test, ucihd_rf_pred)) precision recall f1-score support
0 0.82 0.84 0.83 50
1 0.80 0.78 0.79 41
accuracy 0.81 91
macro avg 0.81 0.81 0.81 91
weighted avg 0.81 0.81 0.81 91
0.9004878048780488
As one can see RF performs very well on this dataset.
To explain a model trained with numerical features, lime by default will discretize continous variables into quantiles for ease of interpretation. Discretization is done using statistics derived from the training dataset.
from lime.lime_tabular import LimeTabularExplainer
cat_ind = [i for i, col in enumerate(ucihd_2.columns) if "_" in col]
ucihd_rf_explainer = LimeTabularExplainer(
ucihd_X_train.values, class_names=["Negative", "Positive"],
feature_names=ucihd_2.columns,
categorical_features=cat_ind)
ucihd_rf_tp_idx = np.where(np.logical_and(ucihd_rf_pred == 1, ucihd_y_test == 1))[0]
ucihd_rf_fp_idx = np.where(np.logical_and(ucihd_rf_pred == 1, ucihd_y_test == 0))[0]
# We take one true positive and one false positive for examples.
ucihd_rf_tp_exp = ucihd_rf_explainer.explain_instance(
ucihd_X_test.iloc[ucihd_rf_tp_idx[0]], ucihd_rf.predict_proba, num_features=4)
ucihd_rf_fp_exp = ucihd_rf_explainer.explain_instance(
ucihd_X_test.iloc[ucihd_rf_fp_idx[0]], ucihd_rf.predict_proba, num_features=4)
ucihd_rf_tp_exp.save_to_file("/tmp/explain_tab_rf_tp.html")
ucihd_rf_fp_exp.save_to_file("/tmp/explain_tab_rf_fp.html")Following the same idea in our discussion on text classifiers, we choose two examples, one true positive and the other false positive, from the RF predictions to demonstrate explanation modeling.
The explanation suggests several dominant features toward the positive. For categoricals each category serves as individual contribution for explanation. This is a natural consequence of one-hot encoding in our feature space.
For the false positive case, the model is less confident. There are indeed more features driving negatively. But one strong positive contribution from the feature ca (number of major vessels colored by flouroscopy) cancel out the entire negative driving forces.
Gradient boosting trees (GBT) is a powerful model family proven to work exceptionally well in many different applications. Yet due to its ensembling nature, GBT is also hard to intrepret in general.
Here we demo lightgbm’s implementation of GBT with LIME explanation.
import lightgbm as lgb
ucihd_tr = lgb.Dataset(ucihd_X_train, label=ucihd_y_train)
ucihd_te = lgb.Dataset(ucihd_X_test, label=ucihd_y_test)
ucihd_lgb_params = {
"learning_rate": .01,
"boosting_type": "gbdt",
"objective": "binary",
"metric": ["binary_logloss", "auc"],
"num_leaves": 8,
"max_depth": 3,
"min_data_per_leaf": 5,
"verbose": -1,
"seed": 64
}
ucihd_bst = lgb.train(
params=ucihd_lgb_params,
num_boost_round=300, early_stopping_rounds=20,
train_set=ucihd_tr, valid_sets=[ucihd_te],
verbose_eval=10)Training until validation scores don't improve for 20 rounds
[10] valid_0's binary_logloss: 0.655886 valid_0's auc: 0.775854
[20] valid_0's binary_logloss: 0.628993 valid_0's auc: 0.785854
[30] valid_0's binary_logloss: 0.60576 valid_0's auc: 0.81122
[40] valid_0's binary_logloss: 0.585625 valid_0's auc: 0.824146
[50] valid_0's binary_logloss: 0.569745 valid_0's auc: 0.826585
[60] valid_0's binary_logloss: 0.556003 valid_0's auc: 0.826098
[70] valid_0's binary_logloss: 0.542218 valid_0's auc: 0.839024
[80] valid_0's binary_logloss: 0.532141 valid_0's auc: 0.843902
[90] valid_0's binary_logloss: 0.524471 valid_0's auc: 0.84439
[100] valid_0's binary_logloss: 0.516474 valid_0's auc: 0.85122
[110] valid_0's binary_logloss: 0.510493 valid_0's auc: 0.85122
[120] valid_0's binary_logloss: 0.506372 valid_0's auc: 0.852683
[130] valid_0's binary_logloss: 0.498944 valid_0's auc: 0.851951
[140] valid_0's binary_logloss: 0.493064 valid_0's auc: 0.854878
[150] valid_0's binary_logloss: 0.49013 valid_0's auc: 0.856341
[160] valid_0's binary_logloss: 0.487886 valid_0's auc: 0.853415
Early stopping, best iteration is:
[145] valid_0's binary_logloss: 0.491347 valid_0's auc: 0.856341
ucihd_lgb_yhat = ucihd_bst.predict(ucihd_X_test)
ucihd_lgb_pred = (ucihd_lgb_yhat > .5).astype(int)
print(classification_report(ucihd_y_test, ucihd_lgb_pred)) precision recall f1-score support
0 0.80 0.78 0.79 50
1 0.74 0.76 0.75 41
accuracy 0.77 91
macro avg 0.77 0.77 0.77 91
weighted avg 0.77 0.77 0.77 91
0.8563414634146341
In this particular (rather small) dataset RF indeed outperforms GBT. As a matter of fact, based on existing benchmark a simple logistic regression may have a even higher score for this problem. Nevertheless, let’s move on to our explanation model with LIME:
def ucihd_lgb_predict_fn(x):
# We need to output 2 columns for binary prob prediction.
p = ucihd_bst.predict(x).reshape(-1, 1)
return np.hstack((1 - p, p))
ucihd_lgb_explainer = LimeTabularExplainer(
ucihd_X_train.values, class_names=["Negative", "Positive"],
feature_names=ucihd_2.columns,
categorical_features=cat_ind)
# We take the same examples previously explained in our RF explanation model.
ucihd_lgb_tp_exp = ucihd_lgb_explainer.explain_instance(
ucihd_X_test.iloc[ucihd_rf_tp_idx[0]], ucihd_lgb_predict_fn, num_features=4)
ucihd_lgb_fp_exp = ucihd_lgb_explainer.explain_instance(
ucihd_X_test.iloc[ucihd_rf_fp_idx[0]], ucihd_lgb_predict_fn, num_features=4)
ucihd_lgb_tp_exp.save_to_file("/tmp/explain_tab_lgb_tp.html")
ucihd_lgb_fp_exp.save_to_file("/tmp/explain_tab_lgb_fp.html")The behavior of GBT looks similar to that of RF in terms of these two examples.
In both case, the variable ca has a dominant impact on the final decision. The two modesl also share the same confusion against the negative example.
lightgbmThis section is a digression on lightgbm usage.
Since lime’s API requires us to prepare our dataset in one-hot encoding representation, our lightgbm code use the same data pipeline as in scikit-learn random forest. But that is actually not optimized for lightgbm. The following code chunk showcases the best practice of encoding categoricals in lightgbm: We don’t encode them at all!
# We leave both missings and categoricals as-is in the dataset.
ucihd_train, ucihd_test = train_test_split(ucihd, test_size=.3, random_state=64)
ucihd_tr = lgb.Dataset(
ucihd_train.drop("label", axis=1), label=ucihd_train["label"],
categorical_feature=categorical_attr,
free_raw_data=False)
ucihd_te = lgb.Dataset(
ucihd_test.drop("label", axis=1), label=ucihd_test["label"],
categorical_feature=categorical_attr,
free_raw_data=False)
ucihd_bst_2 = lgb.train(
params=ucihd_lgb_params,
num_boost_round=300, early_stopping_rounds=20,
train_set=ucihd_tr, valid_sets=[ucihd_te],
verbose_eval=-1)Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[94] valid_0's binary_logloss: 0.519726 valid_0's auc: 0.852683
ucihd_lgb_yhat = ucihd_bst_2.predict(ucihd_test.drop("label", axis=1))
ucihd_lgb_pred = (ucihd_lgb_yhat > .5).astype(int)
print(roc_auc_score(ucihd_test["label"], ucihd_lgb_yhat))0.8526829268292683
To summarize, There are two very special properties about lightgbm algorithm. lightgbm treats missings natively as a special tree split point. This allows us to keep the original missing as is and in many cases can result in better accuracy than imputation.7
In addition, lightgbm encodes categorical variables internally in a more efficient way. So we don’t even need to do one-hot encoding on our own. Of course in this tiny dataset we won’t see any noticable difference. But for large applications the performance impact can be huge. Whatever, by skipping one-hot encoding pipeline our code can be much neater as well.
TODO: Use a pre-trained model?
TODO: Theory Briefing here.
Lundberg and Lee (2017) propose SHAP (SHapley Additive exPlanations), yet another additive feature attribution method for model explainability. It is a more general approach where LIME is indeed only a special case of it. Just like LIME, in theory it can be applied to any machine learning model, but comes with a customized fast implementation particularly for gradient boosting trees (GBT). It supports APIs of well-known GBT libraries such as xgboost, lightgbm, and catboost.
The interpretability provided by SHAP is again local. It assigns each feature an importance value for a particular prediction. Hence it provides for any given model prediction what may be the driving force for the model to make such prediction.
shap also comes with more visualization methods for feature investigation, especially for feature interaction exploration.
TODO: Theory Briefing here.
shap.TreeExplainer is optimized for GBT but not RF. For model with high dimensionality like a bag-of-words model it will suffer from high computation cost for non-GBT model. Hence we will skip the discussion on RF and move forward to a GBT implementation.
In the previous section we didn’t train a GBT for the text classification problem. So let’s quickly build one such model first (with the same TF-IDF vectorization as we did for the RF model).
# lightgbm does not allow utf-8 encoded feature names.
# Since important tokens are most likely ascii-compatible for our dataset,
# we simply strip non-ascii as a workaround for this exercise.
def remove_non_ascii(s):
return "".join([i if ord(i) < 128 else "_" for i in s])
sorted_vocab_ascii = [remove_non_ascii(v) for v in sorted_vocab]
imdb_X_tr = lgb.Dataset(imdb_X_train, label=imdb_y_train, feature_name=sorted_vocab_ascii)
imdb_X_te = lgb.Dataset(imdb_X_test, label=imdb_y_test, feature_name=sorted_vocab_ascii)
imdb_lgb_params = {
"learning_rate": .05,
"boosting_type": "gbdt",
"objective": "binary",
"metric": ["binary_logloss", "auc"],
"num_leaves": 16,
"max_depth": 4,
"min_data_per_leaf": 20,
"verbose": -1
}
imdb_lgb_model_file = os.path.join(model_dir, "text_clf_lgb.txt")
# Save/reload model to save notebook rendering time.
if os.path.exists(imdb_lgb_model_file):
# Parameters are not loaded back? (Which cause the subsequent call to shap_values fail.)
# https://github.com/microsoft/LightGBM/issues/2613
# As a workaround we pass the same parameters to re-construct the model.
imdb_bst = lgb.Booster(model_file=imdb_lgb_model_file, params=imdb_lgb_params)
else:
imdb_bst = lgb.train(
params=imdb_lgb_params,
num_boost_round=1000, early_stopping_rounds=20,
train_set=imdb_X_tr, valid_sets=[imdb_X_te],
verbose_eval=100)
_ = imdb_bst.save_model(imdb_lgb_model_file)Training until validation scores don't improve for 20 rounds
[100] valid_0's binary_logloss: 0.479194 valid_0's auc: 0.88184
[200] valid_0's binary_logloss: 0.424974 valid_0's auc: 0.906732
[300] valid_0's binary_logloss: 0.394577 valid_0's auc: 0.918715
[400] valid_0's binary_logloss: 0.374584 valid_0's auc: 0.925959
[500] valid_0's binary_logloss: 0.359882 valid_0's auc: 0.930936
[600] valid_0's binary_logloss: 0.348809 valid_0's auc: 0.934481
[700] valid_0's binary_logloss: 0.340231 valid_0's auc: 0.937016
[800] valid_0's binary_logloss: 0.333412 valid_0's auc: 0.938953
[900] valid_0's binary_logloss: 0.327711 valid_0's auc: 0.94053
[1000] valid_0's binary_logloss: 0.323358 valid_0's auc: 0.94162
Did not meet early stopping. Best iteration is:
[1000] valid_0's binary_logloss: 0.323358 valid_0's auc: 0.94162
imdb_lgb_yhat = imdb_bst.predict(imdb_X_test)
imdb_lgb_pred = (imdb_lgb_yhat > .5).astype(int)
print(classification_report(imdb_y_test, imdb_lgb_pred)) precision recall f1-score support
0 0.88 0.85 0.86 12500
1 0.85 0.88 0.87 12500
accuracy 0.86 25000
macro avg 0.86 0.86 0.86 25000
weighted avg 0.86 0.86 0.86 25000
0.9416203136
Based on the testing AUC score we find out that GBT performs comparably to neural network models.
Just like RF we will have access to the overall feature importance with a GBT model:8
The global feature ranking reveals some highly ranked features to be meaningless on its own. Especially the word it. But as discussed earlier we shouldn’t over-interpret the ranks without a proper explanation modeling.
Since shap.TreeExplainer is customized for GBT for speed, we can feed in all testing examples to calculate all shap values at once.
import shap
# Sparse matrix is supported by shap for lightgbm models.
imdb_lgb_explainer = shap.TreeExplainer(imdb_bst)
imdb_lgb_shap_values = imdb_lgb_explainer.shap_values(imdb_X_test)
def imdb_lgb_shap_plot(test_id, matplotlib=True):
shap_plt = shap.force_plot(
imdb_lgb_explainer.expected_value[1],
imdb_lgb_shap_values[1][test_id,:],
imdb_X_test[test_id,:].toarray(), # We still need a dense matrix here.
feature_names=sorted_vocab,
matplotlib=matplotlib
)
return shap_pltOne advantage of shap on GBT models is the capability of traverse through all the testing examples due to its efficiency. So we can based on all the resulting shap values to derive a global feature importance judged by their average shap values (contributions). Note that this is different from the loss/impurity or split time-based feature ranking derived from RF/GBT during training. It is an aggregation from all local prediction explanations (contributions) during testing data inference.
shap.summary_plot(imdb_lgb_shap_values, imdb_X_test, feature_names=sorted_vocab,
plot_type="bar", max_display=20, show=False, plot_size=.25)
plt.show()As we can see, the ranking based on shap values for testing set will be in general different from the ranking based on training split. And it is more interpretable: Features with higher rank literally have averagely higher impact on the testing dataset. Also the ranking can be conditioned on labels.
The most important application of shap still lies on instance-level explanation. We stick to the previous two reviews. For the review that RF correctly label positive, we have the shap explanation with the following visualization:
Note that by default shap for lightgbm shows log-odds rather than probability in the plot. So a positive value indicates a positive prediction, otherwise negative.
To verify this:
def to_log_odds(p):
return np.log(p / (1 - p))
def to_p(log_odds):
return np.exp(log_odds)/(1 + np.exp(log_odds))
# Take the first true positive to examine:
p = imdb_bst.predict(imdb_X_test[imdb_rf_tp_idx[0],:].toarray())
print(p)[0.71683258]
[0.92880398]
For any given prediction, the shap values of all features should sum up to the difference between the predicted log-odds and the expected log-odds. To verify this on the specific positive example:
expected_log_odds = imdb_lgb_explainer.expected_value[1]
predicted_log_odds = to_log_odds(p)
print(predicted_log_odds - expected_log_odds) # The difference.[1.09763611]
shap_v = pd.DataFrame({
"token": sorted_vocab,
"shap_value": imdb_lgb_shap_values[1][imdb_rf_tp_idx[0],:],
"tfidf": np.squeeze(imdb_X_test[imdb_rf_tp_idx[0]].toarray())
})
shap_v = shap_v.sort_values("shap_value", ascending=False)
print(shap_v) # Shap values of all features for the example. token shap_value tfidf
8464 incredible 0.469118 0.138549
9893 love 0.358313 0.076663
4420 definitely 0.267002 0.109114
1420 bad 0.200221 0.000000
728 also 0.188378 0.066295
... ... ... ...
8908 it -0.135065 0.031411
1773 better -0.136536 0.075703
7329 great -0.173293 0.000000
14932 silly -0.259718 0.125069
11305 nothing -0.725885 0.084064
[18518 rows x 3 columns]
1.0976361111226303
From the entire shap values we can know for example that the absence of great contributes negatively, and the presence of love contributes positively, to the final prediction.
For the false positive case, similar to RF, the word great play a big role in shaping the GBT prediction toward positive.
As of 2019-12-12 shap.DeepExplainer does not yet support TF 2.0.9 And shap.GradientExplainer is not well documented yet for TF 2.0. So we will use the shap.KernelExplainer which is a implementation-agnostic explainer in shap. The compromise is that it will run very slow for each prediction.
imdb_exp_ind = np.array([imdb_rf_tp_idx[0], imdb_rf_fp_idx[0]])
# KernelExplainer.
def mm(X):
return imdb_tr.predict_proba(X)[:,1]
imdb_nn_shap_explainer = shap.KernelExplainer(mm, seq_train_padded[:100])
# This is VERY slow...
#imdb_nn_kernel_shap_values = imdb_nn_shap_explainer.shap_values(seq_test_padded[imdb_exp_ind])
# TODO:
# Contribution is attributed to original sequence input.
# In order to make explanation readable,
# we need to map each position to original word id then to word.# TODO: Makje sure everything works here.
# shap does not support keras model in scikit-learn wrapper.
# Let's re-build the model and retain its Sequental class.
dl_model = model_fn()
metrics = dl_model.fit(
x=seq_train_padded, y=imdb_y_train,
batch_size=256, epochs=20,
validation_data=(seq_test_padded, imdb_y_test),
validation_steps=20,
callbacks=[
tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2),
tf.keras.callbacks.ModelCheckpoint(tr_model_file, monitor="val_loss", save_best_only=True)
],
verbose=0)
# DeepExplainer.
dl_shap_explainer = shap.DeepExplainer(dl_model, seq_train_padded) # Wont' work.
# GradientExplainer.
imdb_nn_shap_explainer = shap.GradientExplainer(dl_model, seq_train_padded[:100])
imdb_nn_shap_explainer = shap.GradientExplainer(
(imdb_tr.layers[0].input, imdb_tr.layers[-1].output), # Not working for TF 2.0.
seq_train_padded[:100])
imdb_nn_shap_explainer.shap_values(seq_test_padded[:3]) # Error here.We do the same exercise on the tabular dataset previously explained by lime.
ucihd_rf_explainer = shap.TreeExplainer(ucihd_rf)
ucihd_rf_shap_values = ucihd_rf_explainer.shap_values(ucihd_X_test)
def ucihd_rf_shap_plot(test_id, matplotlib=True):
shap_plt = shap.force_plot(
ucihd_rf_explainer.expected_value[1],
ucihd_rf_shap_values[1][test_id,:],
ucihd_X_test.iloc[[test_id]],
matplotlib=matplotlib
)
return shap_pltFrom the global ranking we can confirm that variable ca is definitely an influential feature.
shap.summary_plot(ucihd_rf_shap_values, ucihd_X_test,
plot_type="bar", max_display=10, show=False, plot_size=.25)
plt.gcf().subplots_adjust(bottom=.25, left=.25)
plt.show()ucihd_rf_feat_imp = pd.Series(ucihd_rf.feature_importances_, index=ucihd_X_train.columns).sort_values()
ax = ucihd_rf_feat_imp.tail(10).plot(kind="barh")
plt.show()We can plot partial dependency based on shap values of two features over the entire testing dataset. For example, by knowing that ca is important, we’d like to further know how age can impact the contribution of ca across different examples.
shap.dependence_plot("age", ucihd_rf_shap_values[1], ucihd_X_test, interaction_index="ca", show=False)
plt.gcf().subplots_adjust(left=.25)
plt.show()The result suggests two things:
ca has less impact for yonger peopleBoth can be examined by domain-experts to see if the model is learning the correct pattern that we expected or at least that we can reason.
Note that for scikit-learn RF model by default shap reports probability instead of log-odds. Such behavior difference results from the optimization customized for GBT model family.
For GBT we feed the model that is optimized, where categoricals are encoded internally without explicit one-hot encoding.
ucihd_lgb_explainer = shap.TreeExplainer(ucihd_bst_2)
ucihd_lgb_shap_values = ucihd_lgb_explainer.shap_values(ucihd_test.drop("label", axis=1))
def ucihd_lgb_shap_plot(test_id, matplotlib=True):
shap_plt = shap.force_plot(
ucihd_lgb_explainer.expected_value[1],
ucihd_lgb_shap_values[1][test_id,:],
ucihd_test.iloc[[test_id]].drop("label", axis=1),
matplotlib=matplotlib
)
return shap_pltAs one may now realize, by explicitly one-hot-encode the categorical features we essentially split them into different features in their interpretable representation. This can be either good or bad, depending on the actual use case. From this particular aspect libary such as lightgbm provides the flexibility to allow us choose whether to do the one-hot encoding or not. So the way we want to construct the explanation model may well affect our implementation of the original model.
TODO: Use a pre-trained model?
Nori et al. (2019) publish the open source package interpret for a fast implementation of Generalized Additive Models with Pairwise Interactions, or GA2M (Lou et al. (2013)). As of 2019-12-12, interpret is still in its alpha release with limited documentation. The library contains two groups of modeling frameworks:
glassbox: explanable machine learning modelsblackbox: machine learning explanation models (such as LIME and SHAP)We’ve already covered the mainstream approach in the second group, i.e., models that approximate (locally) the original model (supposed to be a blackbox) for better explainability. The more interesting part of interpret is to bring about another type of model that is readily interpretable from its very origin, and yet still competitively accurate: the Explainable Boosting Machine, or EBM.
EBM is an additive model of the form:
\[ g(E(y)) = \beta_0 + \sum f_j (x_j) + \sum f_{ij}(x_i, x_j), \]
where \(g(\cdot)\) is a link function (sigmoid for binary classification, for an example), \(f_j\) is the feature function for the \(j\)-th feature, learned by a gradient boosting machine with only that feature at a time and in a round-robin fashion for all features. \(f_{ij}\) is a pairwise interaction feature function to further boost the accuracy of the model while remain interpretability.
The model is interpretable since the contribution of any individual feature can be directly quantified by their corresponding feature function \(f_j\). Such explanation can extend up to pairwise interaction if pairwise feature functions are also estimated.
TODO: How to detect pairwise interaction? Brief the FAST algorithm.
EBM is not efficient for text dataset. Due to the algorithm’s design it will run too long for bag-of-words model since there are too many feature functions to estimate. If we fit a EBM with the movie review dataset, even if not a large dataset, we will encounter OOM (out-of-memory) issue. As a result, we will skip the discussion of EBM on a text classifier. (The same restriction applies to image dataset.)
ExplainableBoostingClassifier has a scikit-learn fashion API and hence is straightforward to use.
from interpret.glassbox import ExplainableBoostingClassifier
ucihd_ebm = ExplainableBoostingClassifier(
n_estimators=16, feature_names=ucihd_2.columns, n_jobs=1)
_ = ucihd_ebm.fit(ucihd_X_train, ucihd_y_train)ucihd_ebm_yhat = ucihd_ebm.predict_proba(ucihd_X_test)[:,1]
ucihd_ebm_pred = (ucihd_ebm_yhat > .5).astype(int)
print(classification_report(ucihd_y_test, ucihd_ebm_pred))The model performs very well on the heart disease dataset, outperforming both RF and GBT.
interpret comes with a rich set of visualization tools (with plotly as its backend). Model explanation is divided into two groups: global and local.
For global explanation, we have access to both global feature importance and a per-feature feature contribution stats.
Name Type # Unique % Non-zero
0 age continuous 40 1.000
1 trestbps continuous 45 1.000
2 chol continuous 129 1.000
3 thalach continuous 83 1.000
4 oldpeak continuous 39 0.679
5 slope continuous 3 1.000
6 ca continuous 5 0.387
7 sex_0.0 categorical 2 0.335
8 sex_1.0 categorical 2 0.665
9 sex_nan categorical 1 0.000
10 cp_1.0 categorical 2 0.080
11 cp_2.0 categorical 2 0.170
12 cp_3.0 categorical 2 0.292
13 cp_4.0 categorical 2 0.458
14 cp_nan categorical 1 0.000
15 fbs_0.0 categorical 2 0.830
16 fbs_1.0 categorical 2 0.170
17 fbs_nan categorical 1 0.000
18 restecg_0.0 categorical 2 0.505
19 restecg_1.0 categorical 2 0.014
20 restecg_2.0 categorical 2 0.481
21 restecg_nan categorical 1 0.000
22 exang_0.0 categorical 2 0.670
23 exang_1.0 categorical 2 0.330
24 exang_nan categorical 1 0.000
25 thal_3.0 categorical 2 0.571
26 thal_6.0 categorical 2 0.071
27 thal_7.0 categorical 2 0.349
28 thal_nan categorical 2 0.009
# Global feature importance.
ucihd_ebm_global.visualize().write_html("/tmp/ucihd_ebm_feat_imp.html", include_plotlyjs=False)
# Global contribution on age.
fid = ucihd_ebm_global.selector.Name.tolist().index("age")
ucihd_ebm_global.visualize(fid).write_html("/tmp/ucihd_ebm_age_imp.html", include_plotlyjs=False)
# Global contribution on trestbps.
fid = ucihd_ebm_global.selector.Name.tolist().index("trestbps")
ucihd_ebm_global.visualize(fid).write_html("/tmp/ucihd_ebm_trestbps_imp.html", include_plotlyjs=False)
# Global contribution on sex.
fid = ucihd_ebm_global.selector.Name.tolist().index("sex_0.0")
ucihd_ebm_global.visualize(fid).write_html("/tmp/ucihd_ebm_sex_imp.html", include_plotlyjs=False)More importantly, we must be able to explain a specific model prediction locally. This can also be done easily with a couple of lines:
# Explain the same instances previously on RF.
ucihd_exp_ind = np.array([ucihd_rf_tp_idx[0], ucihd_rf_fp_idx[0]])
# We can feed multiple examples at the same time.
ucihd_ebm_local = ucihd_ebm.explain_local(
ucihd_X_test.iloc[ucihd_exp_ind,:], ucihd_y_test[ucihd_exp_ind])
ucihd_ebm_local.visualize(0).write_html("/tmp/ucihd_ebm_exp_tp.html", include_plotlyjs=False)
ucihd_ebm_local.visualize(1).write_html("/tmp/ucihd_ebm_exp_fp.html", include_plotlyjs=False)For the false positive case made by both RF and GBT, EBM is able to correctly predict the negative label. We still see a positive ca value contribute a lot toward a positive prediction, while EBM is able to also pick up several negative factors that jointly negate the positive impact, ending up with a correct prediction toward negative.
Throughout all the exercises above we only demonstrate limited actual examples, so nothing really conclusive here as which model is more reasonable for each problem in making their decision. But with more investigation there may be more insights on which model can be trusted more than the others.
We summarize the benefit of explanation modeling here. In general it allows us…
Abadi, Martı́n, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, et al. 2015. “TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.” http://tensorflow.org/.
Lou, Yin, Rich Caruana, Johannes Gehrke, and Giles Hooker. 2013. “Accurate Intelligible Models with Pairwise Interactions.” In Proceedings of the 19th Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 623–31. ACM.
Lundberg, Scott M, and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 4765–74. Curran Associates, Inc. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf.
Maas, Andrew L., Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. “Learning Word Vectors for Sentiment Analysis.” In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 142–50. Portland, Oregon, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/P11-1015.
Nori, Harsha, Samuel Jenkins, Paul Koch, and Rich Caruana. 2019. “InterpretML: A Unified Framework for Machine Learning Interpretability.” arXiv Preprint arXiv:1909.09223.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12: 2825–30.
Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. “Glove: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (Emnlp), 1532–43.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “Why Should I Trust You?: Explaining the Predictions of Any Classifier.” In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 1135–44. ACM.
Ushey, Kevin, JJ Allaire, and Yuan Tang. 2019. Reticulate: Interface to ’Python’. https://CRAN.R-project.org/package=reticulate.
Some people will further differentiate explainability from interpretability, by characterizing interpretability as knowing how without knowing why, and explainability as not only knowing how but also knowing why. In this notebook for simplicity we don’t take such approach.↩︎
In many such methods, the simplified input is the indicator of feature presence. One example: Shapley regression values.↩︎
Keras also comes with the dataset preprocessed as integer sequences (from tf.keras.datasets import imdb).↩︎
It will be much faster if we choose xgboost’s or lightgbm’s implementation of random forest. However, to demonstrate compatibility of lime with scikit-learn we purposely choose the slower implementation here.↩︎
Note that, even for a single recurrent layer, training a RNN will be prohibitively slow without a GPU.↩︎
V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.↩︎
xgboost is the first to introduce such missing treatment among all the GBT package. lightgbm follows.↩︎
By default lightgbm calculates the importance by counting how many times a feature contributes to an optimal split during training. It also supports the impurity-based approach with argument importance_type set to "gain".↩︎